library(dplyr)
data_folder <- paste(getwd(), '/../datasets/', sep='')
Opening dataset
dfpurorig <- read.csv(paste(data_folder,"res_purchase_2014.csv",sep=''), sep=',',dec='.')
dfpurorig$Transaction.Date<-as.POSIXct(strptime(as.character(dfpurorig$Transaction.Date), "%m/%d/%Y"))
dfpurorig$Posted.Date<-as.POSIXct(strptime(as.character(dfpurorig$Posted.Date), "%m/%d/%Y"))
dfpurorig$Description<-as.factor(toupper(dfpurorig$Description))
dfpurorig$Cardholder.Last.Name <-as.factor(toupper(dfpurorig$Cardholder.Last.Name))
dfpurorig$Agency.Name <-as.factor(toupper(dfpurorig$Agency.Name))
dfpurorig$Vendor <-as.factor(toupper(dfpurorig$Vendor))
dfpurorig$Cardholder.First.Initial <- as.factor(toupper(dfpurorig$Cardholder.First.Initial))
dfpurorig$Merchant.Category.Code..MCC. <- as.factor(toupper(dfpurorig$Merchant.Category.Code..MCC.))
dfpur<-dfpurorig
Cleaning dataset
a<-as.data.frame(as.numeric(as.character(dfpur$Amount)))
names(a)<-'value'
dfpur[is.na(a$value),'Amount']
[1] ($29.99) $572.27 $12.90 452.91 zero
90449 Levels: -0.01 0.01 -0.02 0.02 0.03 -0.04 0.04 -0.05 0.05 -0.06 0.06 ... 999.99
The field Amount includes some wrong numeric values… let us correct it.
dfpur$Amount<-as.numeric(as.character(dfpur$Amount))
dfpur[is.na(a$value),'Amount']<-c(29.99,572.27,12.90,452.91)
Verifying all fields.
summary(dfpur)
Year.Month Agency.Number Agency.Name
Min. : -999 Min. : 1000 OKLAHOMA STATE UNIVERSITY :115995
1st Qu.:201309 1st Qu.: 1000 UNIVERSITY OF OKLAHOMA : 76143
Median :201401 Median :47700 UNIV. OF OKLA. HEALTH SCIENCES CENTER: 58247
Mean :201090 Mean :42786 DEPARTMENT OF CORRECTIONS : 22322
3rd Qu.:201404 3rd Qu.:76000 DEPARTMENT OF TOURISM AND RECREATION : 17232
Max. :201900 Max. :98000 DEPARTMENT OF TRANSPORTATION : 15689
(Other) :136829
Cardholder.Last.Name Cardholder.First.Initial
JOURNEY HOUSE TRAVEL INC: 10137 J : 55031
UNIVERSITY AMERICAN : 7219 G : 42251
JOURNEY HOUSE TRAVEL : 4693 D : 38120
HEUSEL : 4212 M : 35401
CARDHOLDER : 3789 S : 35081
HINES : 3423 C : 33213
(Other) :408984 (Other):203360
Description Amount
GENERAL PURCHASE :247186 Min. : -42863.0
AIR TRAVEL : 29584 1st Qu.: 30.9
ROOM CHARGES : 18120 Median : 104.9
AT&T SERVICE PAYMENT ITM : 2657 Mean : 425.0
001 PRIORITY 1LB PCE: 2005 3rd Qu.: 345.0
0 : 1828 Max. :1903858.4
(Other) :141077
Vendor Transaction.Date
STAPLES : 14842 Min. :2013-04-17 00:00:00
AMAZON MKTPLACE PMTS : 12197 1st Qu.:2013-09-25 00:00:00
WW GRAINGER : 12076 Median :2014-01-06 00:00:00
AMAZON.COM : 10766 Mean :2013-12-28 12:36:37
BILL WARREN OFFICE PRODUC: 4479 3rd Qu.:2014-04-02 00:00:00
LOWES #00241 : 4231 Max. :2014-06-30 00:00:00
(Other) :383866
Posted.Date
Min. :2013-07-01 00:00:00
1st Qu.:2013-09-26 00:00:00
Median :2014-01-07 00:00:00
Mean :2013-12-30 09:39:08
3rd Qu.:2014-04-03 00:00:00
Max. :2014-06-30 00:00:00
Merchant.Category.Code..MCC.
STATIONERY, OFFICE SUPPLIES, PRINTING AND WRITING PAPER: 24860
BOOK STORES : 21981
INDUSTRIAL SUPPLIES NOT ELSEWHERE CLASSIFIED : 21668
DENTAL/LABORATORY/MEDICAL/OPHTHALMIC HOSP EQIP AND SUP.: 20183
GROCERY STORES,AND SUPERMARKETS : 17152
MISCELLANEOUS AND SPECIALTY RETAIL STORES : 13335
(Other) :323278
Year.month field seems to include wrong “-999” values.
table(dfpur$Year.Month)
-999 201307 201308 201309 201310 201311 201312 201401 201402 201403 201404
586 37635 39314 38762 40266 34275 26969 37230 35830 37720 39249
201405 201406 201900
36022 37955 644
Also “201900” seems a mistake.
Checking if year.months values can be generated through Transaction.date or Posted.Date fields.
Checking consistence of Posted.Date and Transaction.Date. It means verifying if all Posted.Date is equal or after Transaction.Date…
count(dfpur[dfpur$Posted.Date<dfpur$Transaction.Date,])
a<-dfpur[dfpur$Year.Month!=format(dfpur$Posted.Date,'%Y%m'),]
nrow(a)
[1] 1230
a<-dfpur[dfpur$Year.Month!=format(dfpur$Transaction.Date,'%Y%m'),]
nrow(a)
[1] 24771
We can assume that Posted.Date is better than Transaction.Date to regenerate Year.month field. Also, it is possible to say that we have two main wrong values on Year.Month field: “-999” and “201900”.
Updating Year.Month based on Posted.Date field.
dfpur$Year.Month<-as.factor(format(dfpur$Posted.Date,'%Y%m'))
table(dfpur$Year.Month)
201307 201308 201309 201310 201311 201312 201401 201402 201403 201404 201405
37635 39314 38762 40266 34275 26969 37230 35830 38188 39249 36784
201406
37955
Checking how Amount values are distributed…
summary(dfpur$Amount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-42863.0 30.9 104.9 425.0 345.0 1903858.4
It seems that there are some outliers after the 75o percentile. Checking top 20 amounts.
(dfpur%>%select(Vendor, Transaction.Date, Amount)%>%filter(Amount>359)%>%top_n(20))%>%arrange(desc(Amount))
There are some suspect values for a vendor named “PAYMENT ADJUSTMENT”. Exploring “PAYMENT ADJUSTMENT” vendor registers.
(a<-dfpur%>%select(Vendor, Transaction.Date, Amount)%>%filter(Vendor=='PAYMENT ADJUSTMENT'))
It is clear that these registers are not real purchases.
sum(a$Amount)
[1] 4061761
As you can see, $4,061,761 are related to this “PAYMENT ADJUSTMENT” vendor. Also, we realized some negative amounts. Checking for negative numbers…
(b<-dfpur%>%select(Vendor, Transaction.Date, Amount)%>%filter(Amount<=0))%>%arrange(Amount)
sum(b$Amount)
[1] -3562604
The total of Negative numbers is -$3,562,604.
As we do not have any instruction related to “PAYMENT ADJUSTMENTS” and negative amounts, we will not use these amounts for answering questions.
Removing these registers…
dfpur<-subset(dfpur,Vendor!='PAYMENT ADJUSTMENT')
dfpur<-subset(dfpur,Amount>0)
Reducing the name of the field “Merchant.Category.Code..MCC.” to “Merchant.Category”
names(dfpur)[11]<-'Merchant.Category'
Checking for top 10 descriptions.
group_by(dfpur,Description)%>%summarize(c=n())%>%top_n(10)%>%arrange(desc(c))
Checking for other strange descriptions.
head(dfpur%>%group_by(Description)%>%summarize(c=n()),20)
There are several registers with strange descriptions (e.g.,"",0,0000000000, etc.). However, the amounts and other data are correct. We will keep these registers.
What is the total amount of spending captured in this dataset?
sum(dfpur$Amount)
[1] 187541509
How much was spent at WW GRAINGER?
a<-dfpur%>%select(Vendor,Posted.Date,Description,Amount)%>%filter(Vendor=='WW GRAINGER')%>%arrange(Posted.Date)
sum(a$Amount)
[1] 5225095
How much was spent at WM SUPERCENTER?
a<-dfpur%>%select(Vendor,Posted.Date,Description,Amount)%>%filter(Vendor=='WM SUPERCENTER')%>%arrange(Posted.Date)
sum(a$Amount)
[1] 31777.83
What is the standard deviation of the total monthly spending in the dataset?
(a<-group_by(dfpur,Year.Month)%>%summarize(mean=mean(Amount),sd=sd(Amount),count=n()))
sd(a$mean)
[1] 25.22905
Describe the process you would follow to build a model on this dataset to make predictions about the stock market.
What biases might this dataset have if you tried to use it to model equities?
It is important to consider that this dataset includes information on the purchase/billing perspective. Of course, to be assertive for modeling equities, other perspectives (kind of information) are very important and must be avaiable (e.g., costs, cash flow, balance Sheet, income statement, etc.).